Non-linear memory layout transformations and data prefetching techniques to exploit locality of references for modern microprocessor architectures with multilayered memory hierarchies PHD THESIS

نویسنده

  • Evangelia G. Athanasaki
چکیده

One of the key challenges computer architects and compiler writers are facing, is the increasing discrepancy between processor cycle times and main memory access times. To overcome this problem, program transformations that decrease cache misses are used, to reduce average latency for memory accesses. Tiling is a widely used loop iteration reordering technique for improving locality of references. Tiled codes modify the instruction stream to exploit cache locality for array accesses. This thesis adds some intuition and some practical solutions to the well-studied memory hierarchy problem. We further reduce cache misses, by restructuring the memory layout of multidimensional arrays, that are accessed by tiled instruction code. In our method, array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. We present a straightforward way to easily translate multi-dimensional indexing of arrays into their blocked memory layout using simple binary-mask operations. Indices for such array layouts are now easily calculated based on the algebra of dilated integers, similarly to morton-order indexing. Actual experimental and simulation results illustrate that execution time is greatly improved when combining tiled code with tiled array layouts and binary mask-based index translation functions. The stability of the achieved performance improvements are heavily dependent on the appropriate selection of tile sizes, taking into account the actual layout of the arrays in memory. Ôhis thesis provides a theoretical analysis for the cache and TLB performance of blocked data layouts. According to this analysis, the optimal tile size that maximizes L1 cache utilization, should completely t in the L1 cache, to avoid any interference misses. We prove that when applying optimization techniques, such as register assignment, array alignment, prefetching and loop unrolling, tile sizes equal to L1 capacity o er better cache utilization, even for loop bodies that access more than just one array. Increased selfor/and cross-interference misses are now tolerated through prefetching. Such larger tiles also reduce lost CPU cycles due to less mispredicted branches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data locality optimizations for multigrid methods on structured grids

Beside traditional direct solvers iterative methods offer an efficient alternative for the solution of systems of linear equations which arise in the solution of partial differential equations (PDEs). Among them, multigrid algorithms belong to the most efficient methods based on the number of operations required to achieve a good approximation of the solution. The relevance of the number of ari...

متن کامل

A Graph Based Framework to Detect Optimal Memory Layouts for Improving Data Locality

In order to extract high levels of performance from modern parallel architectures, the effective management of deep memory hierarchies is very important. While architectural advances in caches help in better utilization of the memory hierarchy, compiler-directed locality enhancement techniques are also important. In this paper we propose a locality improvement technique that uses data space (ar...

متن کامل

Optimal loop scheduling for hiding memory latency based on two-level partitioning and prefetching

The large latency of memory accesses in modern computers is a key obstacle in achieving high processor utilization. As a result, a variety of techniques have been devised to hide this latency. These techniques range from cache hierarchies to various prefetching and memory management techniques for manipulating the data present in the caches. In DSP applications, the existence of large numbers o...

متن کامل

Exploiting Data Locality in Adaptive Architectures

The speed of processors increases much faster than the memory access time. This makes memory accesses expensive. To meet this problem, cache hierarchies are introduced to serve the processor with data. However, the effectiveness of caches depends on the amount of locality in the application’s memory access pattern. The behavior of various programs differs greatly in terms of cache miss characte...

متن کامل

A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations

ÐThis paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iterat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006